Adaptive Proximal Average Approximation for Composite Convex Minimization
نویسندگان
چکیده
We propose a fast first-order method to solve multi-term nonsmooth composite convex minimization problems by employing a recent proximal average approximation technique and a novel adaptive parameter tuning technique. Thanks to this powerful parameter tuning technique, the proximal gradient step can be performed with a much larger stepsize in the algorithm implementation compared with the prior PAAPG method (Yu 2013), which is the core to enable significant improvements in practical performance. Moreover, by choosing the approximation parameter adaptively, the proposed method is shown to enjoy the O( 1 k ) iteration complexity theoretically without needing any extra computational cost, while the PA-APG method incurs much more iterations for convergence. The preliminary experimental results on overlapping group Lasso and graph-guided fused Lasso problems confirm our theoretic claim well, and indicate that the proposed method is almost five times faster than the stateof-the-art PA-APG method and therefore suitable for higherprecision required optimization. Introduction Let X be a finite-dimensional linear space endowed with the inner product 〈·, ·〉 and its induced norm ‖ · ‖. Here, we are interested in solving the following multi-term nonsmooth composite convex minimization problem F ∗ := min x∈X F (x) = f(x) + g(x) (1) with g(x) = ∑N i=1 αigi(x), where αi ≥ 0 satisfying ∑N i=1 αi = 1, gi : X → [−∞,+∞] is a proper, closed convex function, and f : X→ (−∞,+∞) is a continuously differentiable and gradient Lipschitz convex function with modulus Lf , i.e., ‖∇f(x)−∇f(y)‖ ≤ Lf‖x− y‖, ∀x, y ∈ X. Moreover, we assume that Qi = dom g∗ i is a bounded convex set for all i = 1, · · · , N , in which g∗ i denotes the Fenchel conjugate of gi with the following definition g∗ i (x) = sup ui∈Qi {〈ui, x〉 − gi(ui)} . (2) Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Notice that the boundedness assumption about Qi is actually equivalent to the global Lipschitz continuousness of gi used in (Yu 2013, Assumption 1) according to (Borwein and Vanderwerff 2010, Proposition 4.4.6). This multi-term nonsmooth composite convex minimization problem (1) covers a large number of important applications in machine learning, such as overlapping group Lasso (Zhao, Rocha, and Yu 2009; Mairal et al. 2010), graphguided fused Lasso (Chen et al. 2012; Kim and Xing 2009), graph-guided logistic regression (Ouyang et al. 2013), and other types of regularized risk minimization problems (Teo et al. 2010). The regularization term g(x) = ∑N i=1 αigi(x) often carries some important structure information about the structure of the problem itself or data, such as the structured sparsity (Bach et al. 2011; 2012) and nonnegativity. However, the involved vital multi-term nonsmooth components make the optimization problem (1) too complicated to be solved even if N is small. For the special case N = 0, 1, the most popular first-order methods are the accelerated gradient-type methods enjoying the O( 1 K2 ) optimal iteration complexity (Nesterov 2013b), which were first proposed by Nesterov (Nesterov 1983) for N = 0 and then popularized for N = 1 by Beck and Teboulle (Beck and Teboulle 2009a) and Nesterov (Nesterov 2013a). Beck and Teboulle’s method is called “FISTA” while Nesterov’s method in (Nesterov 2013a) is called “APG”. When N is larger, one feasible method is the subgradient-type method (Nemirovsky, Yudin, and Dawson 1982; Polyak 1977) with an extremely slow iteration complexity O( 1 √ K ). To opt out of this dilemma, Nesterov proposed the smoothed accelerated proximal gradient (S-APG) method (Nesterov 2005a; 2005b) for dealing with nonsmooth minimization involving multi-term nonsmooth functions. To make the smoothed accelerated proximal gradient method to achieve the iteration complexity O( 1 K ), the smoothing parameter must be taken as small as O( 1 K ). However, the small smoothing parameter leads to a small iteration stepsize, which has a negative effect on practical optimization performance. To make the smoothed accelerated proximal gradient method much more appealing, some adaptive smoothing algorithms (Boţ and Hendrich 2015; Tran-Dinh 2015) were proposed based on Nesterov’s smoothing technique. However, another drawback is that the smoothing technique may compromise the important structure contained in those nonsmooth terms. Moreover, as N grows, the approximation error is linearly increasing with respect to N . Regarding this issue, Yu (Yu 2013) utilized the proximal average approximation technique that exploits the proximal mapping information of each nonsmooth term to decrease the approximation error, and then proposed the proximal average based accelerated proximal gradient (PA-APG) method which shows promising performance compared with the smoothed accelerated proximal gradient method in (Nesterov 2005b). To let the PA-APG method enjoy theO( 1 K ) iteration complexity, PA-APG has to suffer from the same issue that the approximation parameter must be as small as O( 1 K ). This issue will make PA-APG impractical if we need to attain higher-precision optimization, which has been demonstrated in our experiments. To tackle this difficulty, we unite an easy-to-use adaptive approximation technique and the proximal approximation technique to propose an Adaptive Proximal Average based Accelerated Proximal Gradient (APA-APG) method, which still enjoys the O( 1 K ) iteration complexity without increasing any extra computational cost. In contrast, Yu’s PA-APG method in (Yu 2013) incurs much more iterations to reach convergence. It should be emphasized that such a combination is nontrivial and also very efficient in enhancing the optimization performance. To accomplish theO( 1 K ) iteration complexity, we first derive the dual formulation of one proximal average approximation function based on the convex analysis techniques (Rockafellar 2015), and then leverage the dual formulation and Danskin’s min-max theory (Bertsekas 1999, Proposition B.25) to establish a much tighter lower estimation of the proximal average approximation function, which is crucial to proving the O( 1 K ) complexity of the APA-APG method. At last, we evaluate our proposed APA-APG method by executing overlapping group Lasso and graph-guided fused Lasso tasks. All experimental results indicate that our APA-APG method is about five times faster than the PA-APG method in (Yu 2013). The rest of this paper is organized as follows. In Section 2 we first give the definition of one proximal average function, then derive its dual formulation, and discover a much stronger property of the proximal average function. In Section 3 we present our proposed APA-APG method along with its two variants denoted by APA-APG1 and APAAPG2, respectively. Importantly, we prove itsO( 1 k ) iteration complexity. In Section 4 we conduct experiments to evaluate APA-APG. Finally, we draw our conclusions in Section 5. In the rest of the paper, we use the notations A := (α1, · · · , αN ), B := Diag(α1, · · · , αN ) and C := B−A∗A. Let IN denote the identity matrix with dimension RN×N . It is easy to check that 0 C IN . Let Q = Q1 × · · · ×QN be the Cartesian product of Qi for i = 1, · · · , N . We express Proxg (x) := arg miny g(y)+ 1 2γ ‖y−x‖ 2 as the proximal mapping of g(y) specified by the parameter γ > 0. Let Diam(Q) := maxx∈Q ‖x‖ be the diameter of set Q, and C 12Q := {Cu | u ∈ Q}. Proximal Average Function and Its Dual Formulation In this section, we first present the definition of one proximal average function and then give its equivalent dual formulation which is vital for us to establish the main results. In addition, more properties and applications about proximal average functions can be found in (Bauschke et al. 2008; Hare 2009; Yu et al. 2015; Zhong and Kwok 2014a; 2014b). Definition 1. (Bauschke et al. 2008, Definition 4.1) The proximal average function of g(x) = ∑N i=1 αigi(x) with parameter γ > 0 is defined via the following optimization problem gγ(x) := inf yi { N ∑ i=1 αigi(yi)+ 1 2γ N ∑ i=1 αi‖yi‖− 1 2γ ‖x‖ :
منابع مشابه
Accelerated Stochastic Gradient Method for Composite Regularization
Regularized risk minimization often involves nonsmooth optimization. This can be particularly challenging when the regularizer is a sum of simpler regularizers, as in the overlapping group lasso. Very recently, this is alleviated by using the proximal average, in which an implicitly nonsmooth function is employed to approximate the composite regularizer. In this paper, we propose a novel extens...
متن کاملJoint Filter and Waveform Design for Radar Stap in Signal Dependent Interference (preprint)
Waveform design is a pivotal component of the fully adaptive radar construct. In this paper we consider waveform design for radar space time adaptive processing (STAP), accounting for the waveform dependence of the clutter correlation matrix. Due to this dependence, in general, the joint problem of receiver filter optimization and radar waveform design becomes an intractable, non-convex optimiz...
متن کاملThe proximal point method revisited∗
In this short survey, I revisit the role of the proximal point method in large scale optimization. I focus on three recent examples: a proximally guided subgradient method for weakly convex stochastic approximation, the prox-linear algorithm for minimizing compositions of convex functions and smooth maps, and Catalyst generic acceleration for regularized Empirical Risk Minimization.
متن کاملForward-backward Truncated Newton Methods for Convex Composite Optimization1
This paper proposes two proximal Newton-CG methods for convex nonsmooth optimization problems in composite form. The algorithms are based on a a reformulation of the original nonsmooth problem as the unconstrained minimization of a continuously differentiable function, namely the forward-backward envelope (FBE). The first algorithm is based on a standard line search strategy, whereas the second...
متن کاملAn Accelerated Proximal Coordinate Gradient Method
We develop an accelerated randomized proximal coordinate gradient (APCG) method, for solving a broad class of composite convex optimization problems. In particular, our method achieves faster linear convergence rates for minimizing strongly convex functions than existing randomized proximal coordinate gradient methods. We show how to apply the APCG method to solve the dual of the regularized em...
متن کامل